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Abstract — We study non-adaptive pooling strategies for detec- 
tion of rare faulty items. Given a binary sparse TV-dimensional 
signal X, how to construct a sparse binary M x A*' pooling 
matrix F such that the signal can be reconstructed from the 
smallest possible number M of measurements y = Fx? We show 
that a very low number of measurements is possible for random 
spatially coupled design of pools F. Our design might find 
application in genetic screening or compressed genotyping. We 
show that our results are robust with respect to the uncertainty 
in the matrix F when some elements are mistaken. 

I. Introduction 

Group testing HI, ||2l, also known as pooling in molecular 
biology, is designed to reduce the number of tests required to 
identify rare faulty items. In the most naive setting each item is 
tested separately and the number of tests is equal to the number 
of items. If, however, only a small fraction of the items are 
faulty, then the number of tests can be decreased significantly 
by creating "pools", i.e. by including more than one item in 
one test and allowing each item to take part in several different 
tests. The main problem is how to design these pools such 
that their number is the smallest possible while allowing for 
a tractable, and robust to noise, reconstruction procedure. 

In adaptive group testing the new pools are designed using 
the results of previous pools. However, many experimental 
situations require a non-adaptive testing where all pools must 
be specified without knowing the outcomes of other pools. 
Mainly two kinds of tests are relevant for practical applica- 
tions. In Boolean group testing, each test outputs negative if 
it does not contain a faulty item and positive if it contains 
at least one faulty item. In linear group testing, each test 
outputs the number of faulty items. In practical applications 
the experimental constraints often require that the size of each 
pool is relatively small and one item does not belong to too 
many pools. In this paper we analyse the non-adaptive (single 
stage) pooling with linear tests and limited size of each pool. 

A number of recent works have discussed a close relation 
between non-adaptive pooling with linear tests and compressed 
sensing S, El, Q, ©, Q, IS], M, DSl- Here, we follow this 
direction of works and build on recent advances in compressed 
sensing that used spatially coupled design of measurements 
and permitted to decrease the number of measurements down 
to the information theoretical limit ITTl . llT2l . llT3l . 



Our results are meant to find applications in various cur- 
rently relevant problems e.g. genetic screening Hj or com- 
pressed genotyping |7|. In these applications, genes of several 
individuals are mixed together and one measures how many 
of the genes in a given pool are "faulty", e.g. related to certain 
genetic problem. This is usually done by introducing markers 
that attach to these faulty genes. The main source of noise 
in such experiments is that the marker does not attach, i.e. 
whereas one thought that a given gene belonged to a given 
pool, in reality it did not. We shall investigate the robustness 
of our results under this type of noise. 

II. Setting and related works 

The non-adaptive pooling for detection of rare faulty items 
that we investigate is defined as follows: Consider a sparse 
binary iV-dimensional vector x. Its components (items) are 
denoted hy Xi, i = 1, . . . , N. The vector is sparse: only pN of 
the components are Xi = 1 (faulty) and the others are Xi = 
(correct), with p ^ 1. A pool a is a subset of components 
a C {!,..., N}. We denote Fai = 1 if component i belongs 
to pool a and Fai = otherwise. The result of a pool/test 

Va = ^Xi = ^ FatXi (1) 

is the number of faulty items in the pool. The goal is to design 
a smallest possible number M of pools such that the vector 
X can be reconstructed in a tractable way from the results of 
these pools y = (yi, . . . , t/m)- In practice the result of a pool 
becomes often unreliable if the pool contains many items and 
also if one item belongs to many pools, because the sample 
corresponding to one item then needs to be split into many 
small pieces. We are hence interested in the case where the 
size of every pool Ka is small (compared to N) and every 
item belongs to only a small number Li of pools. 

This problem is reminiscent of compressed sensing |14|, 
1 15 1 which is designed to measure signals directly in their 
compressed form. In fact our problem is compressed sensing 
with the additional constraint that the signal is binary and with 
a sparse binary measurement matrix Fai- We also consider 
matrix uncertainty: some elements assumed to be Fai = 1 are 
in fact Fai — with probability p. The field of low-density 
parity check (LDPC) error correcting codes fT6l provides 



information about sparse measurement matrices with which 
tractable reconstruction can be achieved. Indeed, the only 
difference between non-adaptive group testing with linear tests 
and LPDC is that the algebra is over integers in group testing 
instead of GF{2) in LDPC. The spatially coupled pooling 
design we study here was first discovered and validated in the 
field of error correcting codes lUTl, Qll, Ull, ll20l . 

The reconstruction algorithm that is most commonly used 
in compressed sensing and that has been also discussed several 
times for reconstruction in group testing and pooling experi- 
ments |3l, [^1 is based on a linear relaxation of the problem 
to real signal components < a;^ < 1. One then minimizes 
the ^i-norm of the signal under the constraints y = Fx. This 
is a convex problem that can be solved efficiently using linear 
programing. In what follows we will use the £i reconstruction 
as a reference benchmark to demonstrate the improvement 
that can be achieved using our pooling design and the belief 
propagation based reconstruction. 

The problem of non-adaptive group testing with linear tests, 
but with no constraints on the sparsity of the matrix F, 
is also known as the coin weighting problem 11271 . 1221 . A 
detailed review of previous results can be found in ||2T1 . It 
was shown fSS^ that, in the limit that interests us where 
N — > oo, reconstruction is not possible with less than 
M — (2 log 2)Af/ log iV tests. On the other hand a successful 
deterministic construction of the measurement matrix, together 
with a polynomial reconstruction algorithm, was found with 
M = (21og2)iV[l + o(l)]/logiV l24l measurements. As far 
as we know, however, the problem with a small (constant) 
number of items in every pool, as treated here, is still open. 

Note that when the number of faulty items R is much 
smaller than N, then another line of works should be con- 
sidered. The best polynomial time non-adaptive algorithms 
known for the coin weighting problem then need M = 
R log2 N measurements |25 1. Our approach is thus useful only 
in regimes where the number of faulty items is larger than 
R > 2{\og1fN/\o^N. Another problem that is closely 
related to non-adaptive pooling as considered here is the sparse 
code division multiple access (CDMA) method ll26l . 1271 ; 
with the difference that the signal x is not sparse and that 
there is usually a considerable Gaussian additive noise on the 
measurement vector y. The goal in CDMA is not to minimize 
the number of measurements M (the number of chips), but to 
support the largest possible amount of noise. Spatial coupling 
was investigated for (dense) CDMA in |28|, |29|. 

III. Spatially coupled design of pools 

The first design of pools (2?^) that we consider is based on 
a random construction. Each pool a contains K items. Each 
item i belongs to L pools. The assignment of items into pools 
is chosen uniformly out of all such possible ones. An example 
of a such random pooling matrix is plotted in Fig. [Tjleft. 

The second design of pools (I?s), for which the total 
number of tests needed for successful reconstruction will be 
considerably smaller, is the "seeded" or "spatially-coupled" 
design, illustrated in Fig. [T] right. First, we divide the N items 



into B equally sized blocks. Pools are also divided into B 
blocks, the first of them (the "seed") being larger than the 
others. We will fix to L the degree of all items, to Kg the 
degree of pools in the first block, and to Kf the degrees 
of pools in the other B — 1 blocks. The size of the first 
block is then Ms = N L/{KsB), and that of the other blocks 
Mf ^NL/{KfB). The overall under-sampling ratio is then 
a^L/{KsB) + {B-l)L/{KfB). When deciding connections 
between items and pools, we first connect randomly each pool 
to items in the block with the same index. Then we apply 
the following rewiring procedure: with probability J for each 
edge (i.e. connection between an item and a pool), we chose 
randomly another edge whose item is in one of the w previous 
blocks (and that has not been rewired yet) and switch the two 
edges. The details of this rewiring procedure do not change 
our results as long as a fraction of about J of new connections 
is created up to distance w. 




Fig. 1 . Left: Random pooling design, each blue point corresponds to an item 
i belonging to a pool a. We took A' = 2000 items, M = 700 pools, each item 
participates in L = 7 pools, and each pool has K = 20 items. Right: Seeded 
(or spatially coupled) design of pools, with N = 2000 items, M = 490 pools, 
B = 10 blocks with Ks = 20, Kf=30, J = 0.2 and w = 2. 



IV. Upper and lower bounds for number of pools 

A simple lower bound on the number of necessary measure- 
ments M can be obtained as follows: To reconstruct exactly 
the iV-component signal of R — pN non-zero components 
each possible configuration of the signal x should correspond 
to a distinct result of the pools y. In other words the number 
of possible outcomes of the measurements must be larger than 
the number of possible signals. If K denotes the number of 
items in each test this gives (^) < [K + 1)^^. In the limit 
of large systems, and constant density of faulty items p this 
gives aLB — H{p) / \og{K + 1), where we use the entropy 
function H{p) = — p log p — {1~ p) log(l — p). Note that this 
lower bound holds also for adaptive pooling design. 

We also derived a first moment upper bound on the critical 
ratio above which reconstruction is in principle possible for 
the random design iVr). Calling N{e) the number of vectors 
X compatible with the measurements y at distance 1 — e from 
the original signal, one obtains: 

E(7V(e)) = gW*(p,e) ^ gA'[alogP(e,A-,p)+ff(e,p)]^ ^2) 

where iJ(e,/9) = — eplogep — 2(1 — e)plog (1 — e)/? 

-(1 - 2p + ep) log (1 - 2p + ep) - H{p), 



and P{e,K,p) - £ 



K\ [1 - 2p(l - 6)]^^'-^" [(1 - e)p] 



]2a 



a=0 



(a!)2(if -2a)! 



When ^{p, e) is negative ever3rwhere except in the vicinity of 
e = 1 then the linear system y = Fx has only one solution 
that corresponds to the signal to be reconstructed. This leads 
to a threshold value auB above which reconstruction is in 
principle possible. As we will see this upper bound is very 
close to the actual threshold. With K — > oo, one can show 



For dense matrices F with K = N, this 



that auB - -logx- 
is a tight bound, as mentioned in the context of coin weighting. 

V. Signal recovery algorithms 

Knowing the set of tests or pools, i.e. the binary matrix 
F, and the measurement results y, one wants to reconstruct 
the signal. In compressed sensing the algorithms that achieve 
exact reconstruction with a smallest possible number of mea- 
surement are based on Bayesian inference, see 1,1 U . I,12J . We 
thus adopt the same strategy here. 

The probability distribution of observing measurements y 
given the signal x and set of pools F is given as 



P{y\x,F) = f[(^2''y^ 



(3) 



To estimate P{x\y,F), with the use of Bayes rule 
P{x\y,F) = P{y\x,F)P{x\F)/P{y\F), we need to assume 
some knowledge of statistical properties of the signal x. 
Denoting the fraction of non-zero elements p we use a prior 



N 



P (x) = l[[il - p)5{x0 + pSix, - 1)] 



(4) 



In compressed sensing it was argued |12| that for pooling 
matrices with random elements this assumption works as well 
for iid signal components as for correlated signals. Of course 
if an additional knowledge about the correlations in the signal 
were available it could be exploited and the performance 
improved further. We shall, however, not assume any such 
additional knowledge. The value of Xi that minimizes the 
number of errors is obtained as the value that is more probable 
according to the marginal distribution for that element 



(5) 



To estimate these marginals we use the canonical belief 
propagation (BP) algorithm |30|. Following the usual deriva- 
tion we introduce for each non-zero matrix element Fai two 
messages, and which are two-component vectors 

(normalized to XcT^" + Xi^" = 1 ™d V'o"*'* +'01^* = 1)> ™d 
we write iterative update equations for these as 



1 



i\a 
K-Xi 

E 



B + x^ 

Va 



B=ya-Xi 

E n rx 



(7) 



In the iterative BP algorithm we initialize ijji and xi as random 
numbers from the interval (0,1) and update equations (|6]|7]i till 
convergence. Note that the argument in equation (j?]) depends 
only on the sum of variables, it can thus be updated with 
the use of a convolution in {Ka — 1)^/2 steps (compared to 
the naive 2^"^^ steps). Once convergence is reached the BP 
estimates of the marginal probabilities are computed by: 



rxi 



(8) 



The BP inference of the item i is then x* = 1 is Xi > Xa 
X* = otherwise. 

The iteration of message updates, starting from random 
messages, is the BP algorithm. It turns out that BP can 
also be used to analyse the optimal Bayes inference. For the 
purpose of this analysis one initializes BP messages to values 
corresponding to the true signal, and iterates the equations 
until convergence. In some region of parameters this will 
reach a different fixed point than the iterations of randomly 
initialized messages. The fixed point that corresponds to the 
result that would be achieved by the (exponentially costly) 
optimal Bayesian inference procedure is the one having the 
largest log-likelihood $. The BP estimate of the log-likelihood, 
also called Bethe free energy, is given by 



$ - ^logZ,-^(A-,~l)logZ,, 

i a 

^ p Y[ vr^ + (1 - p) n ^0^' > 



(9) 



K 



B=ya {xj}ji:aa jeda 

In most practical situations the fraction of "faulty" items p 
is not known in advance. In such cases it can be learnt via 
expectation maximization learning, by iterating the following 
expression p„ew — Y^f=i Xil^^ where the r.h.s. is evaluated 
using the previously estimated value of p. 

VI. Performance and phase diagrams 
A. Noiseless case 

Let us first investigate reconstruction of the signal for the 
random design Vr- We are interested in the smallest possible 
ratio ac = M^/N for which exact reconstruction is still 
possible in the large N limit. With the BP algorithm, exact 
reconstruction is possible if and only if a > asp- The 
threshold (or "phase transition") asp is plotted in Fig. [3] 
We compare this value to the smallest possible ratio ag-^ for 
which the standard convex optimization approach, where one 
minimizes \x\e^ subject to y = Fx and < a;^ < 1, provides 
exact reconstruction. We see in Fig. [3] that is only slightly 
larger than asp for all range of L. 

Using the method explained in previous section, we have 
investigated the performance of Bayes optimal approach by 
evaluating the Bethe free energy $. BP messages initialized 



Fig. 2. Left: The limit of performance for BP (agp), l\ minimization 
and tlie Bayes optimal inference (cic) for random design of pools. The 
values are obtained as averages over 20 instances with A' = 10^ items and 
density of faulty items p = 0.1. Right: Fraction of exactly reconstructed 
signals using BP and £i reconstruction with random and seeded pools for 
N = 10*, L = 7 and p = 0.1. Data obtained from 100 random instances. 
The seeded matrix has B = 20 blocks, the first block (seed) has Ks = 20, 
and following blocks have Kj > 20 corresponding to different total values 
of a. Fraction J = 0.4 of links are connected to w = 2 previous blocks. 



on the true signal are fixed point in the absence of noise, 
corresponding to $ = 0. Randomly initialized BP reaches 
the same fixed point at large values of a, at asp, $ jumps 
discontinuously to negative values, meaning that there is no 
other signal with density p satisfying all the tests. The log- 
likelihood then grows as a decreases and becomes positive at 
ttc below which there are other signals of density p satisfying 
the tests and hence the true signal is undetectable. Above ac 
the exactly evaluated Bayes-optimal inference would hence be 
able to reconstruct exactly the signal. The value of ac is also 
plotted in Fig. |3] and we see that both £i and BP for random 
design of pools Vr are considerably suboptimal. 

Using seeded pooling design improves BP performance by 
moving the ratio a above which BP is able to reconstruct 
exactly the signal down to the Bayes-optimal threshold ac- 
Several works have argued that quite generically when the 
system size N, the number of block B and the interaction 
range w go to infinity (in this order) BP with spatially coupled 
design is able to saturate the threshold ac llSi . lfT9]| . ||20J . 
lim . lfT2l . This statement appUes also to our case. Here we 
investigate the performance of BP for seeded pooling design 
with realistic values of parameters N, B, w. In Fig. |2] right 
we plot the fraction of instances in which the signal was 
reconstructed exactly by BP and £i for the random pooling 
design Vr, and for the seeded design Vg, with a set of 
parameters specified in Table |lj Whereas £i performs basically 
the same for both designs, the performance of BP improves 
considerably in the seeded pooling scheme Vg, for realistic 
values of the parameters. In Fig. [3] we summarize all our 
results for the noiseless case with L = 7, and with a varying 
fraction of faulty items p. Let us also note that the Bayes- 
optimal transition for seeded matrices is not exactly equal to 
the one for random matrices, the two are, however, so close 
that the difference is not distinguishable in Fig. |3] 

B. Noisy case 

We investigate now the robustness of our results to 
measurement-matrix noise. We denote by p the probability that 
a matrix element Fai = 1 was in fact — 0, i.e. item i did 




Fig. 3. Lines are for random pooling design. From top: performance limit 
cif J for £i, performance limit agp for BP, first moment upper bound on the 
Bayes-optimal inference, Bayes-optimal inference transition, and information 
theoretic lower bound. Number of items is Af = 10^, each item is in L = 
7 pools. Data points mark the best performance we achieved with seeded 
matrices with B = 20 and w = 2, other parameters are listed in Table ^ 
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Fig. 4. Left: Error (fraction of wrongly-reconstructed items) as function 
of noise strength p for BP with random matrices and seeded matrices with 
N = 5x lO'*, p = 0.1, L = 7, and different a values, each point is averaged 
over 20 instances. Seeded matrices have B = 10 blocks, parameters are listed 
Table |l] Right: Spinodal, static transition and critical values for seeded 



blocks, and parameters are listed in Table |T] 



not contribute to the result of the test a. With nonzero values 
of p and large system size — > oo exact reconstruction is 
never possible, there is always a nonzero probability that a 
faulty item was never included in any measurement. However, 
for realistic sizes and small values of p we may still obtain 
exact or close-to-exact reconstruction. 

In the left part of Fig. |4]we plot the average error (fraction 
of wrongly reconstructed items) as a function of the noise 
strength p. We see that there is a critical value of p above 
which the performance deteriorates significantly. This value 
is larger for the seeded pooling design than for the random 
pooling design. In the right part of Fig. |4] we then plot the 
critical values of ratio a as a function of noise strength p for 
the random pooling design asp, for the best seeded pooling 
design ascod that we found with realistic parameters, and 
for the Bayes optimal reconstruction ac- For a > 0.35 (for 
p = 0.1, L = 7) there is no longer a value of p where 
the performance deteriorates sharply, instead the transition is 
smooth. Such a phase diagram is qualitatively similar to the 
one of compressed sensing with other types of noises, see 

e.g. m, ED, El. 



TABLE I 



Parameters of seeded matrices in Fig. |3 



p 


Ks 


Kf 


J 




P 


Ks 




J 


0.1 


20 


39 


0.1 




0.2 


15 




27 


0.1 


0.3 


13 


22 


0.1 




0.4 


11 




21 


0.1 




Parameters of seeded matrices in 


Fig 




left 


a 


Ks 




,; 




a 


Ks 


^.f 


J 


0.226 


20 


33 


0.1 




0.279 


19 


26 


0.1 




Parameters of seeded matrices in 


Fig. 
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right 


P 


Ks 


Kf 


J 




P 


K^ 




Kf 


J 





20 


39 


0.1 




0.01 


18 




36 


0.2 


0.02 


16 


34 


0.3 




0.03 


19 




31 


0.3 


0.04 


18 


29 


0.3 




0.05 


24 




27 


0.3 


0.06 


15 


27 


0.4 




0.07 


14 




26 


0.4 


0.08 


19 


24 


0.3 




0.09 


19 




23 


0.4 


0.1 


20 


22 


0.4 




0.11 


20 




21 


0.4 


0.12 


20 


20 


0.4 















VII. Conclusion 

We have studied non-adaptive pooling strategies for de- 
tection of rare faulty items. We have shown that the belief- 
propagation reconstruction algorithm, together with a seeded 
(spatially-coupled) design of the pools, leads to the best-known 
performance so far in the sense that it minimizes the number 
of measurements necessary for exact reconstruction in the 
noiseless case. Our results are very close to Bayes optimality 
and robust with respect to measurement noise corresponding 
to a faulty knowledge of the pools. 

It is quite possible that this pooling design and its recon- 
struction algorithm will find applications in genetic screening. 
We note that our work can be extended to the case when 
the non-zeros items in the signal are real-valued. In this case 
the BP algorithm needs to be replaced by an AMP type of 
algorithm [33|. We are currently investigating this case for 
sparse measurement matrices. 
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